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Abstract 

High-dimensional time-series data are becoming increasingly abundant across a wide va¬ 
riety of domains, spanning economics, neuroscience, particle physics, and cosmology. Fitting 
statistical models to such data, to enable parameter estimation and time-series prediction, is 
an important computational primitive. Existing methods, however, are unable to cope with the 
high-dimensional nature of these problems, due to both computational and statistical reasons. 
We mitigate both kinds of issues via proposing an M-estimator for Reduced-rank System IDen- 
tification (mr. sid). A combination of low-rank approximations, ii and 4 penalties, and some 
numerical linear algebra tricks, yields an estimator that is computationally efficient and numer¬ 
ically stable. Simulations and real data examples demonstrate the utility of this approach in 

a variety of problems. In particular, we demonstrate that mr. sid can estimate spatial filters, 
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connectivity graphs, and time-courses from native resolution functional magnetic resonance 
imaging data. Other applications and extensions are immediately available, as our approach is 
a generalization of the classical Kalman Filter-Smoother Expectation-Maximization algorithm. 

keywords: state-space model, parameter estimation, sparsity, imaging processing, fMRI 


1 Introduction 


High-dimensional time-series data are becoming increasingly abundant across a wide va¬ 
riety of domains, spanning economics ( [Johansen] 1988| , neuroscience ( jFriston et~arj |2003| (, 
and cosmology ( jXie et al.[ [2013| . Fitting statistical models to such data, to enable parameter 
estimation and time-series prediction, is an important computational primitive. Linear dynami¬ 
cal systems (LDS) models are amongst the most popular and powerful, because of their intu¬ 
itive nature and ease of implementation ( |Kalman[ [1963| |. The famous Kalman Filter Smoother 
is one of the most popular and powerful methods for time-series prediction in an LDS, given 
known parameters ( |Kalman[ [1960| (. In practive, however, for many LDSs, the parameters are 
unknown, and must be estimated, a process often called system identification in this domain 
( |Ljung[ |1998| (. To date, there does not exist, to our knowledge, a methodology that provides 
parameter estimates and predictions from ultrahigh-dimensional time-series data (for example, 

p > 10 , 000 ). 

The challenges associated high-dimensional time-series estimation and prediction are mul¬ 
tifold. First, naively such models include dense p x p matrices, which often are too large even 
to store, and much too large to invert in memory. For large sparse matrices, recently, several 
efforts to invert them using a series of computational tricks are promising, though still extremely 
computationally expensive ( jHsieh et ^ [2013| ; [Banerjee et al.i|2013| |. Second, estimators be¬ 
have poorly, due to numerical instability issues. Reduced rank LDS models reduce the number 
of latent dimensions, and therefore partially address this problem ( jCHEN et^|1989| . However, 
without further constraints, the dimensionality of the latent states must be so small as to sig¬ 
nificantly decrease the predictive capacity of the resulting model. Third, even after addressing 
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these problems, the time to compute all the necessary quantities can be overly burdensome. 
Distributed memory implementations, such as using Spark, might be above to overcome this 
problem, but it would at additional costs and set-up burden, as it would require a Spark cluster 
dZaharia et al.i|2010 . 

We address all three of these issues with our M-estimator for Reduced-rank System IDenti- 
fication (mr. sid). By assuming the dimensionality of the latent state space is small (reduced 
rank), relative to the ambient, or observed, space dimensionality, we can significantly improve 
computational tractability and estimation accuracy. By further penalizing the estimators, with 
and/or 4 penalties, utilizing prior knowledge on the structure of the parameters, we gain further 
estimation accuracy in this high-dimensional but relatively low-sample size regime. Finally, by 
employing several numerical linear algebra tricks, we can bring the computational burden down 
significantly. 

These three techniques together enable us to obtain highly accurate estimates in a va¬ 
riety of simulation settings, mr. sid is, in fact, a generalization of the now classic Baum- 
Welch expectation maximization algorithm, commonly used for system identification in much 
lower dimensional linear dynamical systems ( |Rabiner[ 1989) . We numerically show that the 
hyper-parameters can be selected minimizing prediction error on held-out data. Finally, we use 
Mr. Sid to estimate functional connectomes from motor cortex, mr. sid enables us to esti¬ 
mate the regions, rather than imposing some prior parcellation on the data, as well as estimate 
sparse connectivity between regions, mr. sid reliably estimates these connectomes, as well 
as predicts the held-out time-series data. To our knowledge, this is the first time anybody has 
been able to estimate partitions and functional connectomes directly from the high-dimensional 
data, with a single unified approach. To enable extensions, generalizations, and additional ap¬ 
plications, the functions and code for generating each of the figures is freely available from 
https: //github.com/shachen/PLDS/ (referred to as the PLDS Git Repo in following sec¬ 
tions). 
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2 The Model 


In statistical data analysis, one often encounter some observed signals and also some unob¬ 
served or latent variables, which we denote asY = {y^,. and X = (cci,..., xt) respec¬ 
tively. By the Bayesian rule, the joint probability of Y and X is P{X,Y) = P{Y\X)P{X). The 
conditional distribution P{Y\X) and prior can both be represented as a product of marginals: 

T 

P{Y\X) = \[P{y,\y,,,_,,xo.,), 

^ ( 1 ) 

P{X) = P{x,)^P{xt\xo..t-i). 

t=i 

The generic time-invariant state-space model makes the following simplifying assumptions 


P{yt\yo-.t-i.xo,t) ^ P{yt\xt), 

( 2 ) 

P{xt\xo.,t-i) ~ P{xt\xt-i). 

Finally, a lineary dynamical system assumes that both of the above terms are linear Gaus¬ 
sian functions, which, when written as an iterative random process, yields the standard matrix 
update rules: 


= Axt + r;, Y[ ~ iV( 0 , g), ~ 

( 3 ) 

y^ = Cxt + \t, Vt~7V(0,/?), 

where A is the ci x d state transition matrix and C is the p x d generative matrix. xt\s a d x 1 
vector and ?/gs a p x 1 vector. The output noise covariance i? is p x p, while the state noise 
covariance Q\sd x d. Initial state mean ttq is ci x 1 and covariance Vo is d x d. 

The model can be thought as the continuous version of the hidden Markov model (HMM), 
where the columns of C stands for the hidden states. The difference is that in this model, what 
one observe at each time point is not a single state, but a linear combination of multiple states. 
Xt is the weights in the linear combination. A matrix is the analogy of the state transition matrix, 
which describe how the weights xt evolve over time. Another difference is that LDS contains 
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two white noise terms, which are captured by the Q and R matrices. 

Without applying further constraints, the model itself is unidentifiable. Supplemental con¬ 
straints are are thus introduced to address both identifiability and utility. Three basic constraints 
are required to make the model identifiable: 

Constraint 1 : Q is the identity matrix 

Constraint 2: the ordering of the columns of C is fixed based on their norms 

Constraint 3: f/o = 0 

Note that the first two constraints follow directly from Roweis and Ghahramani (1999) ( [Roweis 
T9^ . 

The logic for Constraint 1 is as follows. Since Q is a covariance matrix, it is symmetric 
and positive semidefinite and thus can be expressed in the form EAE’^ where E is a rotation 
matrix of eigenvectors and A is a diagonal matrix of eigenvalues. Thus, for any model where 
Q is not the identity matrix, one can generate an equivalent model using a new state vector 

= A~^/^E'^x with = {A~^/‘^E'^)A{EA^/‘^) and such that the new covariance 

of x^ is the identity matrix, i.e., = I. Thus one can constrain Q = 1 without loss of generality. 

For Constraint 2, the components of the state vector can be arbitrarily reordered; this cor¬ 
responds to swapping the columns of C and A. Therefore,the order of the columns of matrix 
C must be fixed. We follow Roweis and Ghahramani and choose the order by decreasing the 
norms of columns of C. 

Additionally, Vq is set to zero, meaning the starting state xq = ttq is an unknown constant 
instead of a random variable, since there is only a single chain of time series in the neuroimaging 
application. To estimate Vq accurately, multiple series of observations are required. 

The following three constraints are further applied to achieve a more useful model. 

Constraint 4: i? is a diagonal matrix 

Constraint 5: A is sparse 

Constraint 6: C has smooth columns 


and Ghahramani 
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Consider the case where the observed data are high dimensional and the R matrix is very 
large. One can not accurately estimate the many free parameters in R with limited observed 
data. Therefore some constraints on R will help with inferential accuracy, by virtue of signifi¬ 
cantly reducing variance while not adding too much bias. In the simplest case, R is set to an 
identity matrix or its multiple. More generally, one can also constrain matrix R to be diagonal. 
In the static model with no temporal dynamics, a diagonal R is equivalent to the generic Factor 
Analysis method, while multiples of the identity R matrix lead to Principal Component Analysis 
(PCA) dRoweis and GhahramaniHl999| ). 

The A matrix is the transition matrix of the hidden states. In our application, it is a central 
construct of interest representing a so-called connectivity graph. In many applications, it is 
desirable for this graph to be sparse. In this work, an ii penalty term on A is used to impose 
sparsity on the connectivity graph. 

Similarly, for many applications, one wants the columns of C to be smooth. For example, in 
neuroimaging data analysis, each column of C can be a signal in the brain. Therefore having 
the signal spatially smooth can help extract meaningful information from the noisy neuroimaging 
data. In this context, an £2 penalty term on C is used to enforce smoothness. 

With all those constraints, the model becomes: 


Xt+i=Axt + Y[, a;o = 7ro 

( 4 ) 

yt = Cxt + \t, Vt~A^(0,i?). 

where A is a sparse matrix and C has smooth columns. Let 9 = {A^C.R^tto} represents 
all unknown parameters and P{X,Y) be the likelihood for a generic LDS model, then combing 
model and the constraints on A and C lead us to an optimization problem 


9 = argmin {-logPe{X,Y) + Ai||v4||i AsUC'H^} 
e 


( 5 ) 


where Ai and A 2 are tuning parameters and ||. ||p represents the p-norm of a vector. Equivalently, 
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this optimization problem can be written as 


minimize {-logPe(X,l^)} 

subject to: a||^||i + (1 - «)l|C'll2 < t fo'' some t; 

^ £ ^dxd, C G Cpxd, R £ Rpxp, TTo £ T^dxl- 


( 6 ) 


where a = ^^dxd and Cp^d are matrix spaces of x and px d dimensional respectively. 
TZpxp is the p X p diagonal matrix space and vr^xi is the d dimensional vector space. 


3 Parameter Estimation 

The motivating application requires solving optimization problem]^ given only an observed 
sequence (or multiple sequences in some applications) of outputs Y = {y^,... ,yj,), find the 
parameters 6 = {A, C, R, tto} that maximize the likelihood of the observed data. 

Parameter estimation for LDS has been investigated extensively by researchers from control 
theory, signal processing, machine learning and statistics. For example, in machine learning, 
exact and variational learning algorithms are developed for general Bayesian networks. In con¬ 
trol theory, the corresponding area of study is known as system identification, which identifies 
parameters in continuous state models. 

Specifically, one way to search for the maximum likelihood solution is through iterative tech¬ 
niques such as expectation maximization (EM) ( [Shumway and Stoff^|1982| t. The detailed EM 
steps for a generic LDS can be found in Zoubin and Geoffrey (1996) dGhahramani and Hinton 


1996| . An alternative approach is to use subspace identification methods such as N4SID and 


PCA-ID to compute an asymptotically unbiased solution in closed form ( [Van Overschee and 
|De Moor[[1994{|Doretto et al.[ 2003| t. In practice, determining an initial solution with subspace 
identification and then refining the solution with EM is an effective approach ( |Boots[poots| . 

However, the above solutions can not be directly applied to optimization problem [5] due to 
the introduced penalty terms. We therefore developed a novel algorithm called M-estimation for 
Reduced rank System IDentification (mr. Sid), as detailed in the following. 
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By the chain rule, the likelihood in model is 


T 


T 


P{X,Y) = P{Y\X)P{X) = P{xo)l[P{x,\xt-,)l[P{y,\xt) = l[P{x,\xt-i)l[P{y,\xt)U{xo) 


t=i 


t=i 


t=i 


t=i 


where l,ro(a;o) is the indicator function and conditional likelihoods are 

= (27r)-2|i?|-5 exp - Cxtl'p-^lVt - Cxt^ 

P{xt\xt-i) = (27r)"^ exp - Axt_i\'[xt - Aa;t_i]|. 

Then the log-likelihood, after dropping a constant, is just a sum of quadratic terms 


logP(X,l^) = - ^ - Cx,]'R-^[y, - Cx,]) - |log|P| 

t=i 

^ ^ It T 

“ - Axt-i]) - -log|I| + log(l^o(a;o))- 

t=i 

Replace logP(X,l^) with equation]^ optimization problem]^ is 


( 7 ) 


e = argminj ^ {^[y^ - Cxt]^R ^[y^ - Cxt]) - ^log|P| 

® t=i 

^ ^ It T 

+ ^ {^[xt - Axt-i]'[xt - Axt-i]) - -log|I| 

t=i 


'^ill^lli + '^2||C'||; 


log(l^o(*o)) 


( 8 ) 


Denote the target function in the curly braces as $(6',1^,X), then $ can be optimized with 
an Expectation-Maximization (EM) algorithm. 


3.1 E Step 

The E step of EM requires computing the expected log likelihood, T = P[logP(X,l^|l^)]. 
This quantity depends on three expectations: E[xt\Y], E[xtx]\Y] and E[xtx]_^\Y]. We denote 
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their finite sample estimators by: 


xt = E[^t\Y], Pt = E[xtx]\Yl = E[xtx]_,\Y]. 


(9) 


Expectations are estimated with a Kalman filter/smoother, which is detailed in the Ap¬ 
pendix on the PLDS Git Repo. Notice that all expectations are taken with respect to the current 
estimations of parameters. 

3.2 M Step 

The parameters are 6 = {A, C, R, tto}. Each of them is estimated by taking the correspond¬ 
ing partial derivatives of ^{6,Y,x), setting to zero and solving. 

Denote estimations from previous step as 6'°''^ = and current estima¬ 
tions as ttJJ®'"}. Estimation for output noise covariance R has closed 

form solution, 



t=i 



( 10 ) 


t=i 



t=i 


In the bottom line, diag denotes to only extract the diagonal terms of the matrix R, as we 
constrain R to be diagonal in Constraint 4. 

Estimation for initial state also has closed form. The relevant term log(l^o(a;o)) is minimized 
only when vr^J®™ = xq. 

Estimation for transition matrix C also has closed form solution, and the solution can be 
derived by rearranging the terms properly. Terms relevant to C in equation are 


MC-,X,Y) = [\\Vt - Cx,]^R-'[v, - Cx,u + A^IICIh. 


( 11 ) 


t=l 


9 



\nh,{C;X,Y), C is a matrix, we vectorized it to ease optimization and notation. Here we 
follow the methods of |Turlach et al.| < |200^ . Without loss of generality, assume R is the identity 
matrix in equation [Tj] otherwise, one can always write equation pM] as 


T 

- R-'^Cxt] \ + X2\\R-'^C\\ 
i=i X 


Let Y' = {yn, • • •, 2 /ri, 1 / 12 , • ■ ■, 1 /T 2 , • • •, Vip, • • •, vxpY be a Tp x 1 vector from rearranging Y. In 
addition, let 




X' = 




X 


pTxpd 


Finally, vectorize as 


,old _ (/^o\6 ^old ^old ^old ^old ^oldN 

— , . . . , (-'Id , <-^21 ) ■ ■ ■ ) '^2d ) 5 • • • ) '^pd ) 


pi 


''pd 


( 12 ) 


where Cij is the element at row i and column j of C. With these new notations, the equation [TT| 
is equivalent to 


hAC;X,Y) = \\Y' -X'cWl 



(13) 


With the Tikhonov regularization ( [Tikhonovl 1943| , equation [T^ has closed form solution 


cnew ^ ^x'^x' + A2l)"^X'V' 

^new ^ Rearrange c"®'" by equation 

Now let’s look at parameter A. Terms involving A in equation]^ are, 

^ 1 

fx,(A;X,Y) ={-[xt - Axt-i]"[xt - Axt-i]) +Ai||Al||i. 


(14) 


(15) 
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Similar to what we have done to C, equation [i^ is equivalent to 


/A,(A;X,Y) = ||z-Za||2 + Ai||a| 


( 16 ) 


where z is a Trf x 1 vector from rearranging X and Z is a block diagonal matrix with diagonal 
component = (ccq, ..., Unfortunately, equation [^does not have closed form solution 

due to the term. 

Though not having a closed form solution, fx^{A]X,Y) can be solved numerically with a 
Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ( [Beck and Teboulle| |2009[ ). FISTA is 
an accelerated version of the Iterative Shrinkage-Threshholding Algorithm (ISTA) ( [Daubechies 
et al.||2004| ). ISTA is linearly convergent while FISTA is quadratic convergent. Steps of a general 
FISTA algorithm can be found in the Appendix on the PLDS Git Repo. 

FISTA requires calculating the Lipschitz constant L for Vg(z) = Z^(Za - z), where g(z) = 
llZ^a - z\\l. Denote ||Z|| as the induced norm of matrix Z, then L is 


L = sup 


\Z\Zx-Z 

\\x-y\\ 


l|Z Z^ll . II^T 

= sup- — - ^ ||Z I 

x^O II ^11 


|Z|| = ||^ ||||Z||. 


With FISTA and L, matrix A can be updated: 


^new_ FISTA(||Z'a°'^-z||^, A 


(17) 


3.3 Initialization 

R matrix is initialized as the identify matrix, while tto as 0 vector. For A and C, denote 
Y = [yi, ■ ■ ■ ^Yt], a p X T matrix, then the singular value decomposition (SVD) of is = 
UDV^ ^ Vpxd^dxd'VlycT = ^pxdXdxT, where Upxd is the first d columns of U and B^xd is the 
upper left block of D. This notation also applies to ^ 'S then initialzed as Upxd, while 

the columns of XdxT are used as input for a vector autoregressive (VAR) model to estimate the 
initial value for A. 
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Table 1: The Complete EM Algorithm 


Algorithm EM Algorithm for mr. sid 


M Step 

1 . i^new ^ J ^ - C°^^xty[) L as in equation 


t=i 
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2 . Tir = *0 

3. Update as in equation [Ti] 

4. Update A®®™ with FISTA, as in equation piT] 

5. Stop when difference between estimations from this step and previous step 
is less than tolerance or maximum number of iterations reached. 

E Step 

0. Initialize 0 = {A, C, R, ttq} as in Section 


3.3 


if first loop 


1. Update the expectations in[^with the Kairnan filter smoother 


3.4 The Complete EM 


The complete EM algorithm for mr. sid is addressed in Table [i] Notice that all the terms 
involving X in the M-step are approximated with the conditional expectations calculated in E- 
step. 


3.5 Improving Computational Efficiency 

The major factors that affect the efficiency and scalability of the above EM algorithm involve 
the storage and computations of covariance matrix R. The following computational techniques 
are utilized to make the code highly efficient and scalable. 

First, a sparse matrix is used to represent R. When dimension p gets higher, the size of R 
increase quadratically, which will easily exceed the memory capacity of a computer. Fortunately, 
with Constraint 4, R is sparse and can be represented with a sparse matrix. For example, when 
p = 10,000, the full R matrix consumes over 100 gigabyte of memory, while the sparse matrix 
takes less than 1 megabyte. 

In addition, to update R in the M step, directly calculate its diagonal without calculating the 
full matrix R. 

Finally, in the E-step, a term Kt = +R)~^ involving R need to be calculated. 
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which involves the inverse of a large square matrix of dimension p by p. As stated previously, 
such a matrix exceeds available memory when p is high. The Woodbury Matrix Identity is 
employed to turn a high dimensional inverse to low dimensional problem: + R)~^ = 

R-^ - R-^C[{Vl-^)-^ + o'"R-^C]-^C^R-\ 

Also note that quantities like R~^ and R~^C can be pre-computed and reused throughout 
the E step. With the above three techniques, the EM algorithm can scale to very high dimen¬ 
sions in terms of p, d and T, without causing any computational issues. 


3.6 The Data 


The Kirby 21 data are resting-state fMRI scans consisting of a test-retest dataset previously 
acquired at the FM Kirby Research Center at the Kennedy Krieger Institute, Johns Hopkins 
University ( jLandman et al.[|20TT) t. Twenty-one healthy volunteers with no history of neurological 
disease each underwent two separate resting state fMRI sessions on the same scanner: a 
3T MR scanner utilizing a body coil with a 2D echoplanar (EPI) sequence and eight channel 
phased array SENSitivity Encoding (SENSE; factor of 2) with the following parameters: TR 2s; 
3mmx3mm in plane resolution; slice gap 1mm; and total imaging time of 7 minutes and 14 
seconds. 

The Human Connectome Project is a systematic effort to map macroscopic human brain cir¬ 


cuits and their relationship to behavior in a large population of healthy adults (Van Essen et al. 


201 3[ [Moeller et al.j , [201 0| : [Feinberg et aTj [201 0| . MR scanning includes four imaging modali¬ 
ties, acquired at high resolutions: structural MRI, resting-state fMRI (rfMRI), task fMRI (tfMRI), 
and diffusion MRI (dMRI). All 1200 subjects are scanned using all four of these modalities on a 
customized 3 T scanner. All scans consist of 1200 time points. 
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4 Results 


4.1 Parameter Estimation 

Two simulations of different dimensions are performed to demonstrate the model and its 
parameter estimations. 

In the low dimensional setting, p = 300, d = 10 and T = 100. A is first generated from 
a random matrix, then elements with small absolute values are then truncated such that 20 
percent of elements are zeros. After that a multiple of the identity matrix is added to A. A is 
then scaled to make sure its eigenvalues fall within [-1,1] to avoid diverging time series. Matrix 
C is generated as follows. Each column contains random samples from a standard Gaussian 
distribution. Then each column is sorted in ascending order. Covariance Q is the identity matrix 
and covariance R \s a multiple of the identity matrix. At time 0, a zero vector 0 is used as the 
value of xq. Pseudocode for data generation can be found in the Appendix on PLDS Git Repo. 

In the high-dimensional setting, p = 10000, d = 30 and T = 100. The parameter are gener¬ 
ated in a the same manner. 

To evaluate the accuracy of estimations, we elect to define the distance between two matri¬ 
ces A and B is defined as follows 


n 


d(A, B) = argmin log [— . „ „ . 

PeP(n) I ^Trace(P x Ca,b) 


(18) 


where Ca,b is the correlation matrix between columns of A and B, P{n) is the collection of all 
the permutation matrices of order n and P is a permutation matrix. 

As a result of the way it’s defined, d{A, B) is invariant to the scales of columns of A and B. 
It is also invariant to a permutation of columns of either matrix. The calculation of d(A, B) is 
exactly a linear assignment problem and can be solved in polynomial time with the Hungarian 
algorithm ( |Kuhn| , [T9^ . 

Both the generic LDS and the penalized LDS are applied to the simulation data. As the true 
parameters are sparse, we expect that the penalized algorithms would yield better estimations 
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Estimation Accuracy vs Penaity 


Estimation Accuracy vs Penalty 




loglO(Xc) Iog10(?ic) 

(a) Low dimensional setting (b) High dimensional setting 

Figure 1: x axis is tuning parameter Ac under log scale and y axis is the distance between 
truth and estimations; is increasing proportionally with Ac- One can see that in both the 
low dimensional and hight dimensional setting, estimation accuracies for A and C first increase 
then decrease as penalty increases. 

with appropriately chosen penalty parameters. When the penalties are approaching 0, the 
penalized algorithm should converge to the generic model. In addition, when the penalties are 
too big or too small compared to the optimal values, estimations might be less accurate. 

A sequence of tuning parameters Ac are utilized, ranging from 10"® to 10"^. Aa = kAc is set 
to increase proportionally with Ac, where A; is a constant. 

Estimation accuracies are plotted against penalty size Ac in Figure [j] Results from LDS 
and Mr. Sid are overlayed in one plot for comparison. As the figure shows, mr. sid converges 
to the LDS when the penalties are approaching zero. Estimation accuracies first increase with 
penalty size and then decrease due to over-shrinkage. 

As a concrete example, estimations from both methods are compared to the true values of 
parameters in Figurej^ One can see that true values in each column of C matrix are decreasing 
smoothly. which is estimated with optimal penalties Ac = A^ and Aa = kAm, shows similar 
pattern. In terms of A, the true value is sparse with many 0 (blue) values, mr. sid estimation 
Ax^ is also sparse, denoted by the off-diagonal 0 (blue) values. Flowever, LDS estimation Ax_^ 
is not sparse, with many positive (yellow and red) off-diagonal values. 
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Figure 2: Row 1: A truth; non-penalized estimation of A; optimally penalized estimation of A. 
Row 2: C truth; non-penalized estimation of C; optimally penalized estimation of C. 
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Table 2: mr. sid Running Time 


p 

100 

1000 

10000 

100000 

100000 

d 

10 

30 

50 

100 

500 

T 

100 

300 

500 

1000 

1000 

Time (min) 

0.07 

0.30 

5.15 

111.38 

1127.18 


In addition to the improved estimation accuracy, the proposed algorithm is also computa¬ 
tional efficiency and highly scalable. As a demonstration, we measure the running times of 
multiple simulation scenarios and summarize them in Table ^ When both p and d are high 
dimensional, the algorithm can still solve the problem in a reasonable time. 


4.2 Making Predictions 


Another perspective when considering mr. sid model is its ability to make predictions. 
When the parameters 9 and the latent states xt are estimated, one can first use estimated 
xt to predict xt+i and use xt+i to predict pt+i. Similarly, more predictions i/r+ 2 , • • • , 2 /T+fc can 


be made. Intuitively, properly chosen penalties give better estimations and good estimations 
should give more accurate predictions. This idea is demonstrated with a simulation. The pa¬ 


rameter settings for this simulation follow Section |4.1 The correlation between the predicted 
signal and true signal is used as a measure of prediction accuracy. The prediction accuracy 
over penalty size is shown in Figure 

From the plots one can see that the prediction accuracy first improves then drops when the 
penalties increase. The prediction accuracy peaks when the penalty coefficient Aa and Ac are 
around 10"^. This make sense as the same A pair also gives the best estimation for coefficients 
A and C, as in Figure Q] This latter observation provides us a way to pick tuning parameters in 
real applications, as detailed in Section [5| 
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Prediction/Estimation Accuracy vs Penalty 



Figure 3: Estimation and prediction accuracies. The x-axis is the penalty size under log scale, 
while y-axis is the estimation and prediction accuracies. One can see that the penalty that yields 
the most accurate estimation also gives the best predictions. 

5 Application 


When applied to fMRI data analysis, the model has very good interpretability. Each is a 
time step including the entire brain volume. Each column of the C matrix is interpreted as a 
time-invariant brain “point spread function”. At each time point, the observed brain image, y^, is 
a linear mixture of latent co-assemblies of neural activity xt. Matrix A describes how xt evolves 
over time. A can also be viewed as a directed graph if each neural assembly is treated as a 
vertex. Each neural assembly is spatially smooth and connectivities across them are empirically 
sparse. This naturally fits into the sparsity and smoothness assumptions in mr. sid. 

Mr. Sid is applied to analyze the motor cortex of human brains from the KIRBY 21 Data. 
In this application, test-retest scans from two subjects are analyzed. The imaging data are first 
preprocessed with FSL, a comprehensive library of analysis tools for fMRI, MRI and DTI brain 
imaging data dSmith et al.[[20M) . FSL is used for spatial smoothing with Gaussian kernel. Then 
Mr. Sid is applied on the smoothed data. 

The following are basic descriptions of the data and model parameters: number of voxels, 
p = 7396; Number of scans, T = 210; Number of latent states, d = 11. Tuning parameters: 
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Table 3: Similarities Among Estimated A Matrices 


d{-, •)(Amari Error) 

^11 

^12 

A 21 A 22 

^11 

0 



Ai2 

0.076(0.88) 

0 


A 21 

0.105(1.05) 

0.095(1.08) 

0 

A 22 

0.095(1.02) 

0.095(1.09) 

0.085(0.98) 0 


Aa = Ac = 10"®. Different combinations of Aa and Ac, where Aa = Ac is applied to the 
data. The values range from 10"^° to lO'^. Then the estimations are used to make predictions. 
The combination that gives the best predictions is used here. One can also use a grid of 
combinations, but it is time consuming. Max number of iterations for EM and the regularized 
subproblems are both 30 steps. 

A flexible method to choose the number of latent states involves the profile likelihood method 
proposed by Zhu et al. ( |Zhu and GhodsiH200^ . The method assumes eigenvalues of the data 
matrix come from a mixed Gaussian and use profile likelihood to pick the optimal number of 
latent states. Apply the method to all four scans, the numbers of latents states selected are 11, 
6, 14 and 15 respectively. Their average, d = 11, is used. 

Denote the A matrix estimation for the second scan of subject one as A^- Similar notations 
apply to the other scans. Then the similarities among the four matrices are summarized in 
Table 1^ The distance measure in Equation [1^ is used. Another permutation invariant measure 
of distance between two square matrices, the Amari error ( [Amari et^|1996^ , is also provided in 
the table. The Amari error between A and A: E{A,A) = + maitbfci 

i=l j=l * j=l i=l ^ 

where P = (pij) = A~^A. Notice a smaller d{A, B) or Amari error means more similarity. Among 
the six pairs from Au,Ai 2 , A 21 and A 22 , it is expected that the pairs {Aii,Ai 2 ) and {A 21 , A 22 ) give 
the smallest distances, as each pair comes from two scans of the same subject. This idea is 
validated by Table A direct application of this result is to correctly cluster the four scans into 
two group, each group corresponding to a subject. This implies that the A matrices contains 
subject-specific information. The similarities among A matrices are also shown in Figure as 
a heatmap. 
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Similarities Among Estimated A Matrices 



A,, A ,2 A21 A22 


Figure 4: Similarities among the four estimated A matrices. The distance d(-, ■) is used in this 
figure. As one can see, the two red/orange off-diagonal pixels has the minimum distances, 
which correspond to the pairs of (An, A 12 ) and (^ 21 ,^ 22 ) respectively. With this similarity map, 
one can tell which two scans are from the same subject. 


In addition, the 3D renderings of the columns of matrix C from the first scan of subject 
one are shown in Figure (after thresholding). It is helpful to compare those regions to other 
existing parcellations of the mortor cortex. As an example, the blue region in Figure[5]accurately 
matches the dorselmedical (DM) parcel of the five-region parcellation proposed by Nebel MB et 
2M4| . 


al. (Nebel et al. 



Figure 5: 3D rendering of columns of matrix C\ estimation from the first scan of subject one 
shown in this plot. 
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Another application of the algorithm is predicting brain signals. To demonstrate this, the 
algorithm is applied to the Human Connectome Project (HCP) data. Using the profile likeli¬ 
hood method, d = 149 is picked. The data has T = 1200 time points. The first N = 1000 are 
picked as training data, while the rest are used as test data. Then both the SVD method in 


Section [3^^ and mr. sid algorithm are used for estimations. Then fc-step ahead predictions 
are made with equations and estimations from both methods. Pseudocode for fc-step ahead 
predictions is given in Appendix on PLDS Git Repo. The prediction accuracies are shown in 
Figure 1^ (left panel). One can see mr. sid algorithm is giving significantly better predictions 
for the first 150 predictions compared to the SVD method. As the SVD method is also used to 
intialize mr. sid algorithm, this shows that mr. sid algorithm improves estimations from the 
SVD method in terms of short-term predictions. Another observation is, mr. sid algorithm’s 
performance get worse when one predicts into the “long” future (> 150 steps). This is rea¬ 
sonable because the prediction errors from each step will accumulate and yields deteriorating 
predictions as the number of steps increase. 



Figure 6: Prediction accuracies comparison on HCP data. Left: The mean squared error (MSB) 
is used as the accuracy measure. Right: Sample time series plot. The dotted green curve 
stands for the 60% confidence band given by mr. sid model. The true time series is aver¬ 
aged signals from a subsample of voxels. The predictions are also averaged over the same 
subsample. The confidence band is estimated based on the covariance matrix of these vox¬ 
els. A subsample of 20 voxels are picked in this experiment to avoid big covariance matrices 
calculation. All values are log-scaled for plotting purpose. 
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A sample plot of the true time series and predicted values are shown in Figure (right 
panel). We see that mr. sid is giving more accurate predictions and the true signal lies in the 
confidence band giving by mr. sid model. Another observation is that the confidence band is 
getting wider as we predict into the future, which is a result of the accumulated errors. 


6 Discussion 


By applying the proposed model to fMRI scans of the motor cortex of healthy adults, we 
identify limited sub-regions (networks) from the motor cortex. A statistical procedure should be 
further developed to match these regions to existing parcellations of the motor cortex. 

In the future, this work could be extended in two important directions. First, assumptions on 
the covariance structures in the observation equation could be generalized. Prior knowledge 
could be incorporated to covariance R ( [Allen et ar||2014| t. The general rule is that R should be 
general enough to be flexible while sufficiently restricted to make the model useful. A lot of other 
platforms such as tridiagnol and upper triangular could also be considered. Mohammad et al. 
have discussed the impact of auto correlation on functional connectivity, which also provides us 
a direction for extension ( jArbabshirani et al.|(2014j |. 

Finally, the work can also be extended on the application side. Currently, only data from a 
single subject is analyzed. As a next step, the model can be extended to a group version and be 
used to analyze more subjects. The coefficients from the algorithm could be used to measure 
the reproducibility of the scans. 
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Appendix 1 

Algorithm Standard Kalman Filter Smoother for estimating the moments 

required in the E-step of an EM algorithm for a linear dynamical system 
0. Define xl = E[xt\Yl),Yl = Var(a;t|Y[), xt = xj and A = + xjxj'' 

1. Forward Recursions: 

= AYlzl + Q 

Kt = + R)-^ 

+ Kt{yt - Cx\-^) 

V} = Vt^ - KtCVt^ 

= TTo, = Vo 

2. Backward Recursions: 

xj_^ = xlz\ + - A^li) 

Vl, = - Vt^)Jl, 

A,t-1 = aAi + 

= {I- KtC)AV^rI 
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Appendix 2 


In general, FISTA optimize a target function 

min F(x; A) = g(x) + A||x||i (19) 

xdX 

where g ; i?” -)■ i? is a continuously differentiable convex function and A > 0 is the regularization 
parameter. 

A FISTA algorithm with constant step is detailed below 


Algorithm FISTA(g, A). 

1 . Input an initial guess xq and Lipschitz constant L for Vg, set yi = xq, ti = 1 

2. Choose T e (0, l/L]. 

3. Set /c 0. 

4. loop 

5. Evaluate Vg(yk) 

6. Compute xi= S^A(yk - ^Vg(yk)) 

7. Compute 4+i = 

8. yk+i = Xk+(^^)(^Xk-Xk_i) 

9. Set /c ^ A; + 1 

10. end loop 


In the above 

SA(y) = (|y 


A)+sign(y) = 


y-\ if y > A 
y + \ if|/<-A 
0 if \y\ < A. 
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Appendix 3 


Algorithm fc-step predictions with PGA and mr. sid 


1. Denote the estimated parameters with PGA and mr. sid as Apca,Cpca,Apids, and 


3.4 


plds ■ 

2. PGA Estimated latent states att = 1000: xiooo, pea = column 1000 of Xdxx from Section 

3. Mr. Sid Estimated latent states att = 1000: xiooo,pz* is from E step in Section 

4. for i = 1 to k 

5. Xl000+k,pca ApQd 3Zg99_|_/j pea 

6. yi 000 +k,pca Cpca ^IGOO+Zc,pea 

■ ^1000+fc,p/ds -^plds ^999-\-k,plds 

8 - yi000-\-k^plds Cpidg ^1000+/c,p/t/s 

9. end 


3.3 


Appendix 4 


Algorithm Simulation Data Generation 

1. Denote the dimensions as p, d and T respectively 

2. Generate ap x d matrix Go from a standard Gaussian distribution 

3. Sort each column of Go in ascending order to get matrix G 

4. Generate adx d matrix Ao from a standard Gaussian distribution 

5. Add a multiple of the identity matrix to Aq 

6. Replace entries in Aq with small absolute values with 0 

7. Scale Ao to make sure its eigen values are between -1 and 1; use Ao as the A matrix 

8. Let Rbea diagonal matrix with positive diagonal entries and Q be the identity matrix 

9. Generate simulation data with A, G, Q and R 

10. end 
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